Exploratory Data Analysis

Exploration

Summary Statistics

The following is a summary of the data.

TARGET_WINS TEAM_BATTING_H TEAM_BATTING_2B TEAM_BATTING_3B TEAM_BATTING_HR TEAM_BATTING_BB TEAM_BATTING_SO TEAM_BASERUN_SB TEAM_BASERUN_CS TEAM_BATTING_HBP TEAM_PITCHING_H TEAM_PITCHING_HR TEAM_PITCHING_BB TEAM_PITCHING_SO TEAM_FIELDING_E TEAM_FIELDING_DP
Min. : 0.00 Min. : 891 Min. : 69.0 Min. : 0.00 Min. : 0.00 Min. : 0.0 Min. : 0.0 Min. : 0.0 Min. : 0.0 Min. :29.00 Min. : 1137 Min. : 0.0 Min. : 0.0 Min. : 0.0 Min. : 65.0 Min. : 52.0
1st Qu.: 71.00 1st Qu.:1383 1st Qu.:208.0 1st Qu.: 34.00 1st Qu.: 42.00 1st Qu.:451.0 1st Qu.: 548.0 1st Qu.: 66.0 1st Qu.: 38.0 1st Qu.:50.50 1st Qu.: 1419 1st Qu.: 50.0 1st Qu.: 476.0 1st Qu.: 615.0 1st Qu.: 127.0 1st Qu.:131.0
Median : 82.00 Median :1454 Median :238.0 Median : 47.00 Median :102.00 Median :512.0 Median : 750.0 Median :101.0 Median : 49.0 Median :58.00 Median : 1518 Median :107.0 Median : 536.5 Median : 813.5 Median : 159.0 Median :149.0
Mean : 80.79 Mean :1469 Mean :241.2 Mean : 55.25 Mean : 99.61 Mean :501.6 Mean : 735.6 Mean :124.8 Mean : 52.8 Mean :59.36 Mean : 1779 Mean :105.7 Mean : 553.0 Mean : 817.7 Mean : 246.5 Mean :146.4
3rd Qu.: 92.00 3rd Qu.:1537 3rd Qu.:273.0 3rd Qu.: 72.00 3rd Qu.:147.00 3rd Qu.:580.0 3rd Qu.: 930.0 3rd Qu.:156.0 3rd Qu.: 62.0 3rd Qu.:67.00 3rd Qu.: 1682 3rd Qu.:150.0 3rd Qu.: 611.0 3rd Qu.: 968.0 3rd Qu.: 249.2 3rd Qu.:164.0
Max. :146.00 Max. :2554 Max. :458.0 Max. :223.00 Max. :264.00 Max. :878.0 Max. :1399.0 Max. :697.0 Max. :201.0 Max. :95.00 Max. :30132 Max. :343.0 Max. :3645.0 Max. :19278.0 Max. :1898.0 Max. :228.0
NA NA NA NA NA NA NA’s :102 NA’s :131 NA’s :772 NA’s :2085 NA NA NA NA’s :102 NA NA’s :286

Plots

The following density plots show the spread of the data. The red verticle line is the mean and the blue verticle line is the median. The scatter plot shows the relationship between wins and the variable

Missing Data

Batting Strike Outs

To fill the missing in the 102 missing data we will alternate between the two modes (578 and 909)

Pitching Strike Outs

To fill the the 102 missing values with the mean.

Double Plays

To fill the the 286 missing values with the mean.

Scaled and Combined

The idea behind this model is that teams that are better than the average will win more games and teams worse than the average will win less. The way we determine if a team is better than average is by looking at how well they preform at batting, pitching, and fielding.

Since there are more than one way to win a baseball game (i.e. have some power sluggers that hit home runs, vs have really good single batters.) we need to combine the various batting measures. Now since getting a strikout at bat is bad, we need to change the sign of this variable. That way it can be combined and will fit the better teams win more and worse teams less model.

We are going to scale all variables . That centers them at 0 and gives them a standard deviation of 1. We can then combine almost all the batting variables into one measure (hit by pitcher is excluded).


Call:
lm(formula = TARGET_WINS ~ TEAM_BATTING + TEAM_PITCHING + TEAM_FIELDING, 
    data = training)

Residuals:
    Min      1Q  Median      3Q     Max 
-42.578  -8.197   0.281   8.526  49.155 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)    80.8822     0.2885  280.33   <2e-16 ***
TEAM_BATTING    3.6415     0.1897   19.20   <2e-16 ***
TEAM_PITCHING   3.0643     0.2174   14.10   <2e-16 ***
TEAM_FIELDING  -0.1567     0.2062   -0.76    0.447    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 12.35 on 1986 degrees of freedom
  (286 observations deleted due to missingness)
Multiple R-squared:  0.2169,    Adjusted R-squared:  0.2157 
F-statistic: 183.3 on 3 and 1986 DF,  p-value: < 2.2e-16

This model says that the average baseball team will win about 81 games. If their batting is one standard deviation better than the average they will win 4 more games. They will win 3 more games for if their pitching and 0 if their fielding is better than average.

Critical Thinking Group 3

2019-09-16